IRIX Base Documentation 1998 November

home *** CD-ROM | disk | FTP | other *** search

/ IRIX Base Documentation 1998 November / IRIX 6.5.2 Base Documentation November 1998.img / usr / share / catman / u_man / cat1 / perfex.z / perfex

Wrap

Text File | 1998-10-20 | 22KB | 463 lines

PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) NNNNAAAAMMMMEEEE ppppeeeerrrrffffeeeexxxx - a command line interface to R10000 counters SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS perfex [----aaaa | ----eeee eeeevvvveeeennnntttt0000 [----eeee eeeevvvveeeennnntttt1111]] [----mmmmpppp | ----ssss] [----xxxx] [----yyyy] [----tttt][----oooo <<<<ffffiiiilllleeee>>>>] [----cccc <<<<ffffiiiilllleeee>>>>] [----llll <<<<nnnnnnnn>>>>] _c_o_m_m_a_n_d DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN The given _c_o_m_m_a_n_d is executed; after it is complete, _p_e_r_f_e_x prints the values of various hardware performance counters. The counts returned are aggregated over all processes which are descendants of the target command, as long as their parent process controls the child through wwwwaaaaiiiitttt (see wait(2)). The integers _e_v_e_n_t_0 and _e_v_e_n_t_1 index this table: 0 = Cycles 1 = Issued instructions 2 = Issued loads 3 = Issued stores 4 = Issued store conditionals 5 = Failed store conditionals 6 = Decoded branches 7 = Quadwords written back from scache 8 = Correctable scache data array ECC errors 9 = Primary instruction cache misses 10 = Secondary instruction cache misses 11 = Instruction misprediction from scache way prediction table 12 = External interventions 13 = External invalidations 14 = Virtual coherency conditions 15 = Graduated instructions 16 = Cycles 17 = Graduated instructions 18 = Graduated loads 19 = Graduated stores 20 = Graduated store conditionals 21 = Graduated floating point instructions 22 = Quadwords written back from primary data cache 23 = TLB misses 24 = Mispredicted branches 25 = Primary data cache misses 26 = Secondary data cache misses 27 = Data misprediction from scache way prediction table 28 = External intervention hits in scache 29 = External invalidation hits in scache 30 = Store/prefetch exclusive to clean block in scache 31 = Store/prefetch exclusive to shared block in scache PPPPaaaaggggeeee 1111 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) BBBBAAAASSSSIIIICCCC OOOOPPPPTTTTIIIIOOOONNNNSSSS ----eeee eeeevvvveeeennnntttt Specify an event to be counted. 2, 1, or 0 event specifiers may be given, the default events being to count cycles. Events may also be specified by setting one or both of the environment variables TTTT5555____EEEEVVVVEEEENNNNTTTT0000 and TTTT5555____EEEEVVVVEEEENNNNTTTT1111. Command line event specifiers if present will override these. The order of events specified is not important. The counts, together with an event description are written to _s_t_d_e_r_r, unless redirected with the ----oooo option. Two events which mmmmuuuusssstttt be counted on the same hardware counter (see rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5)) will cause a conflicting counters error. ----aaaa Multiplex over all events, projecting totals. Ignore event specifiers. The option ----aaaa produces counts for all events by multiplexing over 16 events per counter. The OS does the switching round robin at clock interrupt boundaries. The resulting counts are normalized by multiplying by 16 to give an estimate of the values they would have had for exclusive counting. Due to the equal-time nature of the multiplexing, it is true with high probability that any events present in large enough numbers to contribute significantly to the execution time will be fairly represented. Events concentrated in a few short regions (say, icache misses) may not be projected very accurately. ----mmmmpppp Report per-thread counts for mp programs as well as (default) totals. By default perfex aggregates the counts of all the child threads and reports this number for each selected event. The ----mmmmpppp option causes the counters for each thread to be collected at thread exit time and printed out, followed by the counts aggregated across all threads. The per-thread counts are labeled by pid. ----oooo <<<<ffffiiiilllleeee>>>> Redirect perfex output to the specified file. ----ssss Start(stop) counting on SSSSIIIIGGGGUUUUSSSSRRRR1111(SSSSIIIIGGGGUUUUSSSSRRRR2222) signal receipt by _p_e_r_f_e_x process. This option causes perfex to wait until it (i.e. the perfex process) receives a SSSSIIIIGGGGUUUUSSSSRRRR1111, before it starts counting (for the child process, the target). It will stop counting if it receives a SSSSIIIIGGGGUUUUSSSSRRRR2222. Repeated cycles of this will aggregate counts. If no SSSSIIIIGGGGUUUUSSSSRRRR2222 is received, the counting will continue until the child exits (a normal case). Note that counting for descendants of the child will not be affected. Thus counting for mp programs cannot be controlled with this option. PPPPaaaaggggeeee 2222 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) ----xxxx Count at exception level (as well as the default user level). Exception level includes time spent on behalf of the user during, e.g., TLB refill exceptions. Other counting modes (kernel, supervisor) are available through the OS ioctl interface ( see rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5) ). To collect instruction and data scache miss counts on a program normally executed by % bar < bar.in > bar.out would be accomplished by % perfex -e 26 -e 10 bar < bar.in > bar.out . CCCCOOOOSSSSTTTT EEEESSSSTTTTIIIIMMMMAAAATTTTEEEE OOOOPPPPTTTTIIIIOOOONNNNSSSS ----yyyy Report statistics and ranges of estimated times per event. Without the -y option, perfex reports the counts recorded by the R10000 event counters for the events requested. As these are simply raw counts, it is difficult to know by inspection which events are responsible for significant portions of the job's run time. The -y option associates an approximate time cost with some of the event counts. The reported times are approximate. Due to the superscalar nature of the R10000, and its ability to hide latency, one cannot state a precise cost for a single occurrence of many of the events. Cache misses, for example, can be overlapped with other operations, so there is a wide range of times possible for any cache miss. To account for the fact that the cost of many events cannot be known precisely, perfex -y reports a range of time costs for each event. """"MMMMaaaaxxxxiiiimmmmuuuummmm,,,,"""" """"mmmmiiiinnnniiiimmmmuuuummmm,,,,"""" and """"ttttyyyyppppiiiiccccaaaallll"""" time costs are reported. Each is obtained by consulting an internal table which holds the "maximum," "minimum," and "typical" costs for each event, and multiplying this cost by the count for the event. Event costs are usually measured in terms of machine cycles, and so the cost of an event generally depends on the clock speed of the processor, which is also reported in the output. The "maximum" value contained in the table corresponds to the worst case cost of a single occurrence of the event. Sometimes this can be a very pessimistic estimate. For example, the maximum cost for graduated floating point instructions assumes that all such instructions are double precision reciprocal square roots, since that is the most costly R10000 floating point instruction. Due to the latency-hiding capabilities of the R10000, the "minimum" cost of virtually any event could be zero since most events can be overlapped with other operations. To avoid simply reporting minimum costs of 0, which would be of no practical use, the "minimum" time reported by perfex -y corresponds to the best case cost of a single PPPPaaaaggggeeee 3333 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) occurrence of the event. The "best case" cost is obtained by running the maximum number of simultaneous occurrences of that event and averaging the cost. For example, two floating point instructions can complete per cycle, so the best case cost is 0.5 cycles per floating point instruction. The "typical" cost falls somewhere between "minimum" and maximum" and is meant to correspond to the cost one would expect to see in average programs. For example, to measure the "typical" cost of a cache miss, stride-1 accesses to an array too big to fit in cache were timed and the number of cache misses generated was counted. The same number of stride-1 accesses to an in-cache array were then timed. The difference in times corresponds to the cost of the cache misses, and this was used to calculate the average cost of a cache miss. This "typical" cost is lower than the worst case in which each cache miss cannot be overlapped, and it is higher than the best case in which several independent, and hence, overlapping, cache misses are generated. (Note that on Origin systems, this methodology yields the time for L2 cache misses to local memory only.) Naturally, these "typical" costs are somewhat arbitrary. If they do not seem right for the application being measuring with perfex, they can be replaced by user-supplied values. See the -c option below. perfex -y prints the event counts and associated cost estimates sorted from most costly to least costly. While resembling a profiling output, this is not a true profile. The event costs reported are only estimates. Furthermore, since events do overlap with each other, the sum of the estimated times will usually exceed the program's run time. This output should only be used to identify which events are responsible for significant portions of the program's run time, and to get a rough idea of what those costs might be. With this in mind, the built-in cost table does not make an attempt to provide detailed costs for all events. Some events provide summary or redundant information. These events are assigned "minimum" and "typical" costs of 0 so that they sort to the bottom of the output. The "maximum" costs are set to 1 cycle so that one can get an indication of the time corresponding to these events. "Issued instructions" and "graduated instructions" are examples of such events. In addition to these summary or redundant events, detailed cost information has not been provided for a few other events such as "external interventions" and "external invalidations" since it is difficult to assign costs to these asynchronous events. The built-in cost values may be overridden by user-supplied values using the -c option below. In addition the event counts and cost estimates, perfex -y also reports a number of statistics derived from the typical costs. The meaning of many of the statistics is self-evident, for example, graduated instructions/cycle. Below are listed those statistics whose definitions require more explanation: PPPPaaaaggggeeee 4444 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) _D_a_t_a _m_i_s_p_r_e_d_i_c_t/_D_a_t_a _s_c_a_c_h_e _h_i_t_s This is the ratio of the counts for "Data misprediction from scache way prediction table" and "Secondary data cache misses." _I_n_s_t_r_u_c_t_i_o_n _m_i_s_p_r_e_d_i_c_t/_I_n_s_t_r_u_c_t_i_o_n _s_c_a_c_h_e _h_i_t_s This is the ratio of the counts for "Instruction misprediction from scache way prediction table" and "Secondary instruction cache misses." _L_1 _C_a_c_h_e _L_i_n_e _R_e_u_s_e The is the number of times, on average, that a primary data cache line is used after it has been moved into the cache. It is calculated as "graduated loads" plus "graduated stores" minus "primary data cache misses," all divided by "primary data cache misses." _L_2 _C_a_c_h_e _L_i_n_e _R_e_u_s_e The is the number of times, on average, that a secondary data cache line is used after it has been moved into the cache. It is calculated as "primary data cache misses" minus "secondary data cache misses," all divided by "secondary data cache misses." _L_1 _D_a_t_a _C_a_c_h_e _H_i_t _R_a_t_e This is the fraction of data accesses which are satisfied from a cache line already resident in the primary data cache. It is calculated as 1.0 - ("primary data cache misses" divided by the sum of "graduated loads" and "graduated stores"). _L_2 _D_a_t_a _C_a_c_h_e _H_i_t _R_a_t_e This is the fraction of data accesses which are satisfied from a cache line already resident in the secondary data cache. It is calculated as 1.0 - ("secondary data cache misses" divided by "primary data cache misses"). _T_i_m_e _a_c_c_e_s_s_i_n_g _m_e_m_o_r_y/_T_o_t_a_l _t_i_m_e This is the sum of the typical costs of "graduated loads," "graduated stores," "primary data cache misses," "secondary data cache misses," and "TLB misses," divided by the total program run time. The total program run time is calculated by multiplying "cycles" by the time per cycle (inverse of the processor's clock speed). PPPPaaaaggggeeee 5555 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) _L_1--_L_2 _b_a_n_d_w_i_d_t_h _u_s_e_d (_M_B/_s, _a_v_e_r_a_g_e _p_e_r _p_r_o_c_e_s_s) This is the amount of data moved between the primary and secondary data caches, divided by the total program run time. The anmount of data moved is calculated as the sum of the number of "primary data cache misses" multiplied by the primary cache line size and the number of "quadwords written back from primary data cache" multiplied by the size of a quadword (16 bytes). For multiprocess programs, the resulting figure is a per- process average since the counts measured by perfex are aggregates of the counts for all the threads. One needs to multiply by the number of threads to get the total program bandwidth. _M_e_m_o_r_y _b_a_n_d_w_i_d_t_h _u_s_e_d (_M_B/_s, _a_v_e_r_a_g_e _p_e_r _p_r_o_c_e_s_s) This is the amount of data moved between the secondary data cache and main memory, divided by the total program run time. The anmount of data moved is calculated as the sum of the number of "secondary data cache misses" multiplied by the secondary cache line size and the number of "quadwords written back from secondary data cache" multiplied by the size of a quadword (16 bytes). For multiprocess programs, the resulting figure is a per- process average since the counts measured by perfex are aggregates of the counts for all the threads. One needs to multiply by the number of threads to get the total program bandwidth. _M_F_L_O_P_S (_M_B/_s, _a_v_e_r_a_g_e _p_e_r _p_r_o_c_e_s_s) This is the ratio of the "graduated floating point instructions" and the total program run time. Note that while a multiply-add carries out two floating point operations, it only counts as one instruction, so this statistic may underestimate the number of floating point operations per second. For multiprocess programs, the resulting figure is a per-process average since the counts measured by perfex are aggregates of the counts for all the threads. One needs to multiply by the number of threads to get the total program rate. A ststistic is only printed if counts for the events which define it have been gathered. ----cccc <<<<ffffiiiilllleeee>>>> Load a cost table from <file> (requires -y). This option allows one to override the internal event costs used by the -y option. <file> contains the list of event costs which are to be overridden. This <file> needs to be in the same format as the output produced by the -c option. Costs may be specied in units of "clks" (machine cycles) or nsec (nanseconds). One may override all or only a subset of the default costs. PPPPaaaaggggeeee 6666 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) One may also use the file /etc/perfex.costs to override event costs. If this file exists, any costs listed in it will override those built into perfex. Costs supplied via the -c option will override those provided by the /etc/perfex.costs file. ----tttt print the cost table used for perfex -y cost estimates to STDOUT These internal costs may be overridden by specifying different values in the file /etc/perfex.costs, or by using the -c <file> option. Both <file> and /etc/perfex.costs need to use the format as provided by the -t option. It is recommended that one capture this output to a file and edit it to create a suitable file for /etc/perfex.costs or the -c option. One does not have to specify costs for every event, however. Lines corresponding to events whose values one does not wish to override may simply be deleted from the file. FFFFIIIILLLLEEEESSSS /etc/perfex.costs DDDDEEEEPPPPEEEENNNNDDDDEEEENNNNCCCCIIIIEEEESSSS ppppeeeerrrrffffeeeexxxx only works on an R10000 system. For the ----mmmmpppp option only, only binaries linked -shared are currently supported. This is due to a dependency on _l_i_b_p_e_r_f_e_x._s_o. The options ----ssss and ----mmmmpppp are currently mutually exclusive. LLLLIIIIMMMMIIIITTTTAAAATTTTIIIIOOOONNNNSSSS The signal control interface (----ssss) can control only the immediate target process, not any of its descendants. This makes it unusable with multi- process targets in their parallel regions. SSSSEEEEEEEE AAAALLLLSSSSOOOO rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5), lllliiiibbbbppppeeeerrrrffffeeeexxxx(3), ttttiiiimmmmeeee(1), ttttiiiimmmmeeeexxxx(1) PPPPaaaaggggeeee 7777